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Cepstral Analysis Technique for Automatic 
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Abstract -This paper describes new techniques for automatic speaker 
verification using telephone speech. The operation of the system is 
based on a set of functions of time obtained from acoustic analysis of a 
fixed, sentence-long utterance. Cepstxum coefficients are extracted by 
means of LPC analysis successively throughout an utterance to form 
time functions, and frequency response distortions introduced by trans- 
mission systems are removed. The time functions are expanded by 
orthogonal polynomial representations and, after a feature selection 
procedure, brought into time registration with stored reference func- 
tions to calculate the overall distance. This is accomplished by a new 
time warping method using a dynamic piogramming technique. A de- 
cision is made to accept or reject an identity claim, based on the overall 
distance. Reference functions and decision thresholds are updated for 
each customer. 

Several sets of experimental utterances were used for the evaluation 
of the system, which include male and female utterances recorded over 
a conventional telephone connection. Male utterances processed by 
ADPCM and LPC coding systems were used together with unprocessed 
utterances. Results of the experiment indicate that verification error 
rate of one percent or less can be obtained even if the reference and test 
utterances are subjected to different transmission conditions. 
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I. Introduction 

SPEAKER verification is a process to accept or reject the 
identity claim of a speaker by comparing a set of measure- 
ments of the speaker's utterances with a reference set of mea- 
surements of the utterance of the person whose identity is 
claimed. 

Research on an automatic system for speaker verification at 
Bell Laboratories has been reported in previous papers [1]- 
[4] . The system is based on an acoustic analysis of a fixed, 
sentence-long utterance resulting in a function of time 01 con- 
tour for each feature analyzed. Features selected for analysis 
in previous evaluations have included pitch, intensity, the first 
three formants, and selected prediction coefficients. The sys- 
tem which uses pitch and intensity contours has been evaluated 
using telephone speech over a period of five months with a test 
population of over 100 male and female speakers. The evalu- 
ation indicated an error rate of approximately ten percent for 
new customers and approximately five percent for adapted 
customers [4] . It has also been shown that the performance 
of this system is relatively insensitive to transmission systems 
in which the speech is encoded using adaptive differential 
pulse code modulation (ADPCM) coding or linear predictive 
coding (LPC) vocoding [5] . 
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.FURUI: AUTOMATIC SPEAKER VERIFICATION 

This paper describes new techniques for an automatic 
speaker verification system for telephone-quality speech. The 
differences between the present implementation and previous 
implementations of the system lie in the features selected for 
analysis and the method of overall distance computation. In 
addition, new and enlarged samples of speech, including several 
kinds of transmission systems have been used for evaluation. 

II. System Operation 

A block diagram indicating the principal operations of the 
system is shown in Fig. 1 . There are two inputs to the system, 
the identity claim and the sample utterance. The identity 
claim which may be provided by a keyed-in identification 
number causes reference data corresponding to the claim to 
be retrieved. The second input is activated by a request to 
speak the sample utterance. The recording interval is scanned 
to find the endpoints of the utterance. The utterance is then 
analyzed. linear predictor coefficients are extracted suc- 
cessively and these coefficients are transformed into cepstrum 
coefficients. The cepstrum coefficients are averaged over the 
duration of the entire utterance and the average values are 
subtracted from the cepstrum coefficients of every frame to 
compensate for frequency-response distortions introduced by 
the transmission system. 

The time functions, of the cepstrum coefficients are ex- 
panded by an orthogonal polynomial representation over short 
time segments. Then the utterance is represented by the time 
functions of coefficients of the orthogonal polynomial repre- 
sentation. A part of the set of these coefficients is selected for 
speaker verification, based on the statistical analysis of the 
effectiveness of each coefficient. 

A crucial property of the system is automatic time registra- 
tion of the time functions of the sample utterance to the time 
functions retrieved as the reference template of the claimed 
identity. An overall distance between the sample utterance 
and the reference template is obtained as the result of time 
registration using a dynamic programming technique. The dis- 
tance of each element is weighted by intraspeaker variability 
and summed to produce the overall distance. Finally, the 
overall distance is compared with a threshold distance value to 
determine whether the identity claim should be accepted or 
rejected. 

Details concerning the analysis procedures, reference con- 
struction, time registration, and distance calculation will be 
presented in the following sections. 

A. Normalized Cepstrum Extraction 

The speech wave is bandlimited from 100 Hz to 3.0 kHz and 
sampled at a 6.67 kHz rate, or bandlimited from 100 Hz to 
2.6 kHz and sampled at 6 kHz. The digitized speech is then 
scanned forward from the beginning of the recording interval 
and backward from the end to determine the beginning and 
end of the actual sample utterance. The endpoint detection is 
accomplished by means of an energy calculation. A high 
emphasis filter (1 - 0.95Z") is applied to the delimited 
speech, and a 30 ms Hamming window is applied to the em- 
phasized speech every 10 ms. First to tenth-order linear pre- 
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Fig. 2. Block diagram for cepstrum extraction. 

dictor coefficients are extracted from each frame by the auto- 
correlation method. The linear predictor coefficients are 
transformed into cepstrum coefficients, using the following 
recursive relationships [9] : 



n -I 



k = l K 



(1) 



where c ( and a ( are the ith-order cepstrum coefficient and 
linear predictor coefficient, respectively. Fig. 2 shows the 
block diagram of these processes. 
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Atal [9] examined several different parametric representa- 
tions of speech derived from the linear prediction model for 
their effectiveness for automatic recognition of speakers 
Among all the parameters investigated, the cepstram was 
found to be the most effective. It was also pointed out that 
cepstrum coefficients have the additional advantage that one 
can derive from them a set of parameters which are invariant 
to any fixed frequency-response distortion introduced by the 
recording apparatus or the transmission system. The new pa- 
rameters are obtained simply by subtracting from the cepstrum 
coefficients a set of values representing their time averages 
over the duration of the entire utterance. This process can 
normalize the gross spectral distribution of the utterance, and 
it is similar to the inverse filtering process which has been used 
in a spoken word recognition system at Bell Laboratories [6] . 
The normalization technique introduced by Atal is used in the 
speaker verification system studied in this paper. 

In previous studies by the author [7] , [8] it was shown that 
this normalization process is also effective in reducing long- 
term intraspeaker spectral variability for maintaining high 
speaker verification and identification accuracy over a lone 
period. 5 

B. Polynomial Coefficients 

Time functions of the normalized cepstrum coefficients are 
expanded by an orthogonal polynomial representation over 
90 ms intervals every 10 ms. The 90 ms interval length seemed 
adequate for preserving transitional information between pho- 
nemes. The fust three orthogonal polynomials are used Thev 
are [10] ' 
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Thus, if the control function samples for an utterance within 
the segment being measured are x,(J = 1 , 2, • • • , 9), then the 
first three coefficients of the orthogonal polynomial repre- 
sen tati on are 



Fig. 3. Block diaeram indicating the principal operations of the system 
(modification of Fig. 1). 

of feature extraction is modified as shown in Fig. 3, since 
cepstrum normalization does not affect the first- and second- 
order polynomial coefficients. 

Accordingly, the utterance is represented by time functions 
of the cepstrum coefficients x t (i), and the first- and second- 
order polynomial coefficients, b t (f) and c f (i), where t is the 
frame number and i is the index of the cepstrum coefficient 
(1 <C t < p). Since p is set to ten in this system, the result is a 
representation by a time function of a 30-dimensional vector. 
From these 30 elements, a set of elements, which are most ef- 
fective in separating the overall distance distribution of cus- 
tomer and impostor sample utterances are selected for speaker 
verification. The selection is made based on the inter-to- 
intraspeaker variability ratio for each ejement: 

n i = d Bi/d W i 

* BI ° fk =*<&//) 

<J*k) ' 



dt/k ~ 



£ 
t,m 

(I*m. if 



Hjklm 



(4) 



(3) 



These coefficients represent mean value, slope, and curvature 
of the t>me function of each cepstrum coefficient in each seg- 
ment, respectively. e 

As the original time functions of cepstrum coefficients are 

SSSEt^ T efficient than the Oth - ord?r p°*" omia 

uonfof / SPe3 ~ VCrification . *e original time func- 
tions of cepstrum coefficients are used to replace the Oth-order 

2nS? COef ? den £ in thiS ^mentation. When Sne ot 
order polynomial coefficients are not used, the block diagram 



where E means averaging over the index /. and d iiklm is the 
distance between the time function of ith element derived 
from Ith utterance by speaker / and mth utterance by speaker 
K after time registration. 

C Time Registration 

A sample utterance is brought into time registration with the 
reference template to calculate the distance between them 
This is accomplished by a new time warping method using 
dynamic programming technique. As there is often some un- 
certainty in the location of both the initial and final frames 
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due to breath noise, etc., the unconstrained endpoint tech- 
nique [11] is applied. 

We denote two contours as *(*), Kn<N t and r(m), 
1 < m < M. We denote R (n) and T(m) as guide contour and 
slave contour, respectively. The purpose of the time warping 
algorithm is to provide a mapping between the time indexes n 
and m such that a time registration between the two utterances 
is obtained. We denote the mapping w, between n and m as 

m = w(/i). (5) 

Hie function w must satisfy a set of boundary conditions at 
the endpoints of the utterance and some restrictions on the 
form it assumes. In our case, the following conditions are 
applied: 



w(w+l)~ w ( W ) = o,l,2 1)) 

* i,2 (w(n) = w(n- 1)) 
1 < w(l)<6 + 1 
M- 5 <w(N)<M 
maxw(«) = M, N-S^n^N 
M m 



(6a) 
(6b) 
(6c) 
(6d) 
(6e) 

(6f) 



Equations (6a) and (6b) require that w(w) be monotonically 
increasing, with a maximum slope of two, and a minimum 
slope of 1/2. The minimum slope constraint is a consequence 
of the prohibition against two consecutive steps with slope 0. 
In (6c) s (6d), and (6e), 8 represents the maximum anticipated 
range of mismatch (in frames) between boundary points of the 
two utterances. In our case, a value of 6 of 15 (frames) was 
used, representing a 150 ms region in which the initial and final 
frames could be mapped. 

The warping function can reach the final boundary of the 
slave contour prior to the last frame, i.e. ( it is possible that 



w(n) = M for n<N 



(7) 



in which case it is not physically meaningful to continue the 
path. 

Equation (6f) restricts the warping function within some 
X , ' eei0n ^ the dia S°naI line which connects (1, 1) and 
OV, M) points on the (n - m) plane. In our case m Q was set to 
W (frames). From these conditions the warping function is 
constrained to follow a path inside the shaded region of Fig 4 
The vertices of the labeled points A and B are obtained as the 
intersections of the lines. 
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tour. If we allow the warping function to also start from any 
frame of the guide contour, the computation time becomes ex- 
cessive. If the intraspeaker variation of utterance lengths is 
not large, we have found that the likelihood is great that the 
optimum warping function starts from the first frame of the 
shorter of either the reference template or test utterance. The 
basis for this result is described in Section VI-F*. Based on this 
assumption we adopted the procedure of using as the guide 
contour the shorter of either the reference template or test 
utterance. This means that the longer one is mapped to the 
slave contour axis which is the ordinate of the warping plane 
and the shorter one is mapped to the guide contour axis which 
is the abscissa of the warping plane in Fig, 4. 

A complete specification of the warping function results 
from a point-by-point measure of similarity between the guide 
contour R (m) and the slave contour T(m). 

D. Distance Measure 

A similarity measure or distance function D must be defined 
for every pair of points (n, m) within the shaded region of 
Fig. 4. Given the distance function D, the optimum dynamic 
path w is chosen to minimize the accumulated distance D T 
along the path, i.e., 



N 



D T = min £ ^(*0O.r(M>(n»). 



(9) 



M= I (n- N + 8) 1 
W - 5 - 1 = 2(n- 1) J 

m - M + 8 = 2(n - N) j 



point A 



point B 



(8) 



As can be seen m Fig. 4, the warping function can start from 
any frame of the slave contour between the first and 5 + lth 
frame, but it must start from the first frame of the guide con- 



When the warping function reaches the final boundary of the 
slave contour prior to the last frame, the accumulated distance 
Dt is scaled by the factor (N/N s ) where N s is the frame at 
which (7) is satisfied, so as to equalize the number of distances 
which enter into the total distance D T . The optimum path w 
eSly* d6termined by the method of dynamic programming 

Let us denote the feature vector of the nth frame of the 
guide contour as R („) = (r, („), ^ . . . > r<(n)> . . . > 
and the mth frame of the slave contour as r(m)=(t,(m) 
'2OO, • ■ ■ , t,(m), t K (m)), where K is the number of the 
elements of the feature vector. In this paper, two kinds of 
distance measures are used and evaluated. 



Di(R(n), Tim)) = £ - t,(m)f 
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D 7 (R(n) 9 T(my) = ^ gi \r t {n) - r,.(m) \^ (10b) 

where g t is the weighting function, which is the reciprocal of 
the mean value of intraspeaker variability for the /th element, 
defined as follows: 



INTRA-SPEAKER DISTANCE 



N 



d m = E(d tff )= E £ (/,/*(«) -f, /f (w(n))) a 
J ? n«i 



CD 




INTER-SPEAKER 01STANCE 



where f ijk (n) is the «th frame of the fcth utterance by 
Speaker j. 

E. Decision Threshold 

The overall distance accumulated over the optimum warping 
function is compared with a threshold to determine whether 
to accept or reject an identity claim. In many kinds of speaker 
verification experiments, the threshold is set a posteriori so 
that the two kinds of error rate (the rate of rejecting utterances 
which should be accepted and the rate of accepting utterances 
which should be rejected) are equal. But these experiments 
are unrealistic, and procedures for setting thresholds in ad- 
vance in practical situations are not well established. 

In this paper, two methods for setting an a priori threshold 
are evaluated. In the first method, the threshold is set to an 
experimentally decided fixed value, and the same threshold is 
used for all customers. In the second method, the optimum 
threshold is estimated based on the distribution of overall dis- 
tances between each customer's reference template and a set 
of utterances of other speakers. In the latter case, the thresh- 
old is updated at the same time as the reference template up- 
dating, based on the distribution of interspeaker distances. 
The following equation, based on empirical results, is used to 
set the threshold for each customer: 



(12) 



where 0(k) is the threshold for the customer *, $ DB (k) and • 
<fefi(%) are mean value and standard deviation for the distribu- 
tion of interspeaker distance, respectively, a and b are con- 
stant parameters which are set experimentally, the same values 
being used for all customers and for all data sets. 

Fig. 5 shows an example of typical intraspeaker and inter- 
speaker distance distributions. Equation (12) indicates that 
as the mean value of the interspeaker distance becomes larger 
and the standard deviation becomes smaller, the decision 
threshold becomes larger. The intraspeaker distance distribu- 
tion is not taken into account in the calculation of the decision 
threshold. Therfe are two reasons for this. First, the intra- 
speaker distance distributions are fairly uniform from speaker 
to speaker. Second, it is difficult to obtain stable estimates of 
the distribution of intraspeaker distance for small numbers of 
trairung utterances, whereas it is easy to estimate interspeaker 
distance distributions by cross comparison of the training 
utterances between different customers. 
A posteriori equal error decision thresholds were also used 



DISTANCE 

Fig. 5. Example of typical intraspeaker and interspeaker distance 
distributions. 

to compare the results with those obtained using a priori 
thresholds. 

R Reference Construction 

The establishment and updating of reference information is 
another important element of the system. For each kind of 
data set, three or five utterances were used to construct a 
reference template for each customer. Two methods of refer- 
ence updating were observed. In the first method, the refer- 
ence template was updated every seventh access by the cus- 
tomer using his latest utterances (method 1). In the second 
method, it was updated each time the system was accessed by 
the customer (method 2). The procedure for establishing the 
initial reference template is the following. The first training 
utterance is used as a basic utterance, to which the second is 
brought into time registration. After registration the time 
functions of the feature parameters of the first two utterances 
are averaged and the third is brought into time registration 
with the averaged function and then averaged into it. When 
fiye utterances are used to construct the reference template, 
the fourth and fifth utterances are also brought into time 
registration and included in the averaging. 

The training utterances are also used for the calculation of 
the weighting function which is used in the distance measure 
of (10a) and (10b), the interspeaker to intraspeaker variability 
ratio of (4) which is used in feature selection, and the inter- 
speaker distance distribution which is used to set the decision 
threshold using (12). 

III. Sample Utterances 
Several kinds of utterance sets were used to evaluate this 
system. Fig. 6 is a block diagram which shows the procedures 
used to create the utterance sets. The speech was uttered in a 
sound booth and recorded over conventional dialed up tele- 
phone lines or a high-quality microphone. The signal was 
bandlimited from 100 to 3200 Hz, which is the nominal tele- 
phone bandwidth. Hie telephone speech was processed by 
the following three transmission systems: 

1) clear channel— i.e., no additional processing, 

2) adaptive differential pulse code modulation (ADPCM) 
coding, 

3) linear predictive vocoding (LPC). 

The ADPCM coder used in this experiment was a simulation 
of the coder built by Bates [12] , based on the work of Cum- 
miskey et al. [13]. Fig. 7 shows a block diagram of the 
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Fig. 7. Block diagram of the ADPCM system. 

ADPCM system. Since the required sampling rate for the 
ADPCM coder was 6 kHz, a sampling rate conversion system 
was used to convert it from 10 kHz to 6 kHz at the input to 
the coder [14] . The signal bandwidth was reduced to 2.6 kHz 
for the ADPCM coder in the sampling rate conversion system. 
In the coder, a 4-bit adaptive quantizer was used to code the 
differential signal giving an overall bit rate of 24 kbits/s for the 
coder [13] . 

A block diagram of the LPC vocoder is given in Fig. 8. The 
implementation was based on the autocorrelation method of 
linear prediction [15], [16]. Pitch detection and voiced- 
unvoiced decision were performed using the modified auto- 
correlauon pitch detector of Dubnowski etal. [17] . A 12 pole 
LPC analysis (p= 12) was performed using a pitch adaptive 
variable frame size, at a rate of 100 frames per second [18] 
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Fig. 8. Block diagram of the LPC system. 
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Utterancb Sets Used in Experiments 



Fig. 6. Block diagram indicating the procedure to make the utterance 

sets. 
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6.67 



No quantization of the LPC parameters was used in this 
experiment. 

The sampling rate of the signal passed through the LPC 
vocoder or the clear channel was converted from 10 kHz to 
6.67 kHz, or if bandiimited to 2.6 kHz, converted to 6 kHz. 

The telephone speech utterance set includes the following. 

1) 50 recordings made by each of 10 male and 10 female 
speakers over a period of two months. The first 10 recordings 
were made once a day;. the remaining 40 were made twice a 
day (morning and afternoon). These speakers were designated 
"customers." 

2) One recording made by each of 40 male and 40 female 
naive speakers. These speakers were designated "impostors." 
There was no attempt to mimic the "customers." 

The speech recorded over a high-quality microphone was 
bandiimited from 100 to 3200 Hz and sampled at 6.67 kHz. 
Tne high-quality speech utterance set includes the following. 

1) 26 recordings made by each of 21 male customers over 
a period of two months. Each was recorded on a different day. 

2) One recording made by each of 55 male impostors with 
no attempt to mimic the customers. 

Two all-voiced sentences were used in the recordings. The 
males used the sentence, "We were away a year ago" and the 
females used the sentence, "I know when my lawyer is due." 
Table I summarizes the six kinds of utterance sets used in this 
experiment. All the low-pass filters of 3.2 kHz and 2.6 kHz 
are digital filters, except that the 2.6 kHz low-pass fdter ap- 
plied to ADPCM speech is an analog hardware filter. 
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Fig. 9. Interspeaker to intraspeaker distance ratio for the first ten 
utterances by five speakers each extracted from utterance set (1). 

IV. Experimental Results 
A. Results for Utterance Set (1 ) 

The first experiment was performed using utterance set (1), 
which is a set of utterances by ten male customers and 40 
male impostors recorded over a conventional telephone con- 
nection, transferred through a clear channel and sampled at 
6.67 kHz. The distance measure D x is used in the experiments 
in Sections IV-A-D. The cepstrum normalization technique 
using the averaged value of the cepstrum is not applied to the 
experiments in these sections. 

1) Distance ratio: In order to evaluate the feature parame- 
ters from the viewpoint of their effectiveness for speaker veri- 
fication, the ratio of the average value of interspeaker distance 
to the average value of intraspeaker distance defined by (4), 
was calculated for each parameter. 

Fig. 9 shows the results for an utterance set which uses the 
first 10 utterances by five customers each. Fig. 10 shows the 
results for an utterance set which comprises the middle 10 
utterances by five customers each. It can be seen from these 
figures that all of these parameters have distance ratios greater 
than one, which means that all of them are useful to distin- 
guish speakers. The original cepstrum coefficients are gen- 
erally most effective and the higher the order of the poly- 
nomial coefficients becomes, the less effective they become 
irrespective of the order of the cepstrum. ' 

A preliminary experiment indicated that using utterances 
from ten customers in the feature selection process produced 
no improvement in speaker verification accuracy over the 
procedure described here using five customers. 
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Fig. 10. Interspeaker to intraspeaker distance ratio for the middle ten 
utterances by five speakers each extracted from utterance set (1). 

TABLE II 

A vera ob Error Rates. Utterance Set: No. (1). FR: False Rejection 
(False Alarm). FA: False Acceptance CMjss Rate). 



Threshold 


FR 


FA 


FR + FA 
2 


A 

Priori 


E*u mated 


0.29% 


0.08% 


0.19% 


Fixed 


0.29% 


0 31% 


0.30% 


A Posteriori 


0% 



Based on these results, 18 parameters which have a relatively 
large distance ratio were selected. These are all ten cepstrum 
coefficients and all the first-order polynomial coefficients ex- 
cept coefficient index numbers 5 and 8. None of the second- 
order polynomial coefficients were included in this selected 
parameter set. The choice of 18 for the number of selected 
parameters was decided arbitrarily. 

2) Speaker verification: Table II shows the mean-error rate 
of speaker verification when five utterances were used to estab- 
lish an initial reference template which was updated every 
seventh utterance by each customer. The mean interval be- 
tween training and test utterances is nearly six days. Three 
types of decision thresholds were applied. The error rates were 
averaged over ten customers and presented in this table. When 
the threshold is set a posteriori the error rate is completely 
zero. When the threshold is set a priori the mean-error rate of 
false acceptance and false rejection can be made as small as 
0.19 percent using the optimum threshold estimation tech- 
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TABLE III 

Avexagb Error Ratbs. Utterance Set: No. (6). FR: False Rejection 
(False Alarm). FA: False Acceptance (Miss Ratb). 
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Threshold 


FR 


FA 


FR + FA 








2 


A 


Esiimaied 


0.29% 


0 43% 


0.36% 


Priori 


Fixed 


0.28% 


0.54% 


0.42% 


A Posteriori 


0.06% 



nique presented in Section II-E. When the threshold is fixed 
to a value which is common to all customers, the mean-error 
rate increases to 0.30 percent. These results show that the 
speaker verification techniques proposed in this paper are very 
powerful for telephone speech. 

3) Effects of time interval between training and test 
utterances: Intersession variability for a given speaker is one 
of the most important problems in speaker verification [7] , 
[8] . In order to check the effect of the time interval between 
training and test utterances on speaker verification accuracy 
this interval was varied from six days to six weeks comparing 
test utterances with reference templates constructed at times 
corresponding to the specified intervals. In this experiment 14 
utterances were used as test inputs by ten customers each, and 
five utterances were used to construct each reference tem- 
plate. The experimental results indicated that verification ac- 
curacy is not affected by time intervals between training and 
input utterances of at least six weeks. 

B. Results for Utterance Set (6 ) 

The second experiment was performed using utterance set 
(6) which comprises the utterances by ten female customers 
and 40 female impostors recorded over a conventional tele- 
phone connection. 

The ratio of interspeaker distance to intraspeaker distance 
for each parameter was calculated using the first ten utterances 
by five customers each. The result was similar to the result for 
male speakers shown in Fig. 9. The original cepstrum coeffi- 
cients are most effective and the second-order polynomial co- 
efficients are less efficient than the first-order ones. This re- 
sult was used for the selection of 18 parameters, which include 
all ten cepstrum coefficients and all the first-order polynomial 
coefficients except coefficients index numbers 4 and 9. 

Table III presents the speaker verification results under the 
same conditions observed for the male speaker set described in 
the previous section; a reference file was constructed using 
five utterances and updated every seventh access by each cus- 
tomer. Although the error rates for the female utterance set 
are slightly larger than those for the male utterance set they 
are still very small. ' 

Results for a speaker verification experiment in which a 
reference template was constructed using five utterances and 
the time interval between training and test utterances was 
varied up to six weeks were quite similar to that of the male 
speaker set. There was no significant increase in error rate 
when the interval is extended to six weeks. 
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(b) METHOD Z TRAINING 

F f "-J* 0 updating procedures used for utterance set (5) 
In Method 1, a reference template is updated every seventh access by 
the customer. In Method 2, a reference template is updated each time 
the system is accessed by the customer. In both methods, the latest 
five utterances are used to update the reference template 



C. Results for Utterance Set (5) 

In the third experiment, utterance set (5) which comprises 
26 utterances by 21 male customers each and a single utter- 
ance by 55 impostors recorded over a high-quality microphone 
was used to test the speaker verification system. In this case 
the 18 selected parameters include the first nine cepstrum co- 
efficients and the first nine first-order polynomial coefficients. 

Fig. 11(a) shows the time relation between training and test 
utterances in speaker verification experiments for the condi- 
tion that five utterances were used to construct a reference 
template for each customer and that it was updated every 
seventh access by the customer. Table IV shows error rates 
averaged over 21 customers. Results of the first, middle, and 
last seven input utterances are averaged separately. False re- 
jection error is very large for the first seven input utterances. 
Initial variability like this was also shown in the previous ex- 
periment by Rosenberg [4] . 

In order to improve the results for the first seven input 
utterances, the second method for reference updating was in- 
troduced. The reference template was updated at each time of 
the customer's access using the latest five utterances as shown 
in Fig. 1 1(b). Table V shows the results of the verification ex- 
periment using this method. In this case, the error rate for the 
first seven input utterances is not significantly larger than 
those for the middle and last seven Utterances. Compared with 
Table IV, it can be also concluded that frequent updating of 
the reference template is quite efficient for several initial input 
utterances but it is not necessary to do It for the remaining 
utterances. If we apply the second reference updating method 
to the first seven input utterances and the first reference up- 
dating method to the remaining utterances, we can achieve 
verification error rate of less than one percent using the 
a prion estimated threshold. 
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TABLE IV 

Average Error Rates. Utterance Set: No. (5). Reference Updating: 
Metkod ], FR: False Rejection (Falsb Alarm). FA: False 
Acceptance (Miss Rate). 



TABLE VI 

Average Error Rates. Utterance Set: No. (3). FR: False Rejection 
(False Alarm). FA: False Acceptance (Miss Rate). 
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Emir 


Utterance i 


Fim 
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La it 
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Prion 


Estimated 


FR 


06% 




0.68% 


Fa 


0.6B% 


0.73% 


0.98% 


FR + FA 
2 


2.72% 


0.71% 


0.83% 


Fixed 


FR 


9.52% 


0.66% 


2 04% 


FA 


0*6% 


0.66% 


0.88% 


FR + FA 
2 


5.09% 


0.67%. 


1.46% 


A Po&ieriori 


Equal 


1. 17% 


0.73% 


0.24% 



TABLE V 

Average Error Rates. Utterance Set: No. (5). Refbrence Updating: 
Method 2. FR: False Rejection (False Alarm). FA False 
Acceptance (Miss Rate). 



Threshold 


Error 


Uu trances 


First 


Middle 


Last 


A 

Priori 


Estimated 


FR 


0.66% 


0.68% 


0% 


FA 


0.86% 


0.60% 


0.60% 


FR + FA 


0.77% 


0.64% 


0.30% 


2 


Filed 


FR 


1.36% 


0.68% 


0.68% 


FA 


0.56% 


0.74% 


0.57% 




FR ■+ FA 


0.96% 


0.7j% 


0.635 


2 


A Posteriori 


Equal 


0,94% 


0.17% 


0.343 



D. Results for Utterance Set (3) 

In the fourth experiment, utterance set (3) which comprises 
50 utterances by ten male customers each and a single utter- 
ance by 40 impostors, recorded over a conventional telephone 
connection and transformed by a 24 kbit/s ADPCM system, 
was used to evaluate the speaker verification system. In this 
case, the 18 selected parameters include all ten cepstrum co- 
efficients and all the first-order polynomial coefficients ex- 
cept coefficients with index numbers 1 and 2. 

Table VI. shows the results of speaker verification experi- 
ments when an initial reference template for each customer 
was constructed using five utterances and updated every 
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0 29% 
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0.56% 


Fixed 


1.43% 


1.46% 


1.45% 


A Posteriori 1 0.04% 



TABLE VII 

Average Error Rates by Estimated a priori Threshold. 



Utterance set 


No. (1) 


No. (6) 


No. (5) 


No. O) 


Customers 
_ 


10 Mule 


10 Female 


21 Male 


10 Male 


Impostors 


40 Male 


40 Female 


55 Male 


40 Male 


Transmission 


Telephone 


Telephone 


Microphone 


Telephone 
24 kb/s ADPCM 


False Rejection 
(False Alarm) 


0.29% 


029% 


0.68% 


0.29% 


False Acceptance 
(Min Rale) 


0.08% 


0.43% 


0.86% 


C.B3% 


Average 


0.19% 


0.36% 


0.77% 


0.56% 


Number of Trials 


5.500 


5,500 


12.726 


5.500 



seventh access. Although the error rates are slightly larger 
than those obtained for clean speech presented in Table II, 
both false rejection and false acceptance are still less than one 
percent even when the decision threshold is set a priori by 
(12). Speaker verification results showing the effect of ex- 
tending the interval between training and test utterances up to 
six weeks indicated that the error rate was slightly greater 
than that obtained for clean speech. 

Table VII shows the summary of the results of speaker veri- 
fication experiments for utterance sets (1), (3), (5), and (6), 
using the a priori threshold specified by (12). For utterance 
set (5), reference templates were updated following each access 
for the first seven test utterances using the latest five utter- 
ances. After the seventh utterance, updating was carried out 
every seventh access. For all other utterance sets updating was 
carried out only after each seventh access. For all utterance 
sets except (5) there were 35 customer test utterances and 515 
impostor test utterances per customer for a total of 350 cus- 
tomer and 5150 impostor trials, respectively. For utterance 
set (5) there were 21 customer test utterances and 585 im- 
postor test utterances per customer for a total of 441 customer 
and 12 285 imposter trials, respectively. 

Although this table indicates a higher error rate for micro- 
phone speech than for telephone speech, the difference is sta- 
tistically insignificant since the number of utterances which 
caused the verification error is very small. 
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V. Experiments Using Mixed Transmission Systems 
A. Experimental Design 

In order to investigate the effects of several transmission sys- 
tems on the speaker verification system more thoroughly, the 
reference and test utterances were subjected to different trans- 
mission systems and evaluated using the same techniques used 
in previous experiments with homogeneous transmission con- 
ditions. One difference between the techniques used in 
previous experiments and this experiment is that the trans- 
mission characteristics normalization method, subtracting the 
tune averages from cepstrum coefficients, described in Section 
II-A, was applied to all utterances in this experiment 

Utterance sets (2), (3), and (4), each of which comprises 50 
utterances by ten male customers each and a single utterance 
by 40 male impostors were used in the experiment. They were 
recorded over a conventional telephone connection trans- 
mitted over clear, ADPCM and LPC vocoder channel, respec- 
tively. All of these utterances were sampled at 6 kHz. 

B. Result of Preliminary Experiments 

1) Distance ratio: Ten utterances by five customers each 
were, used to calculate the ratio .of averaged interspeaker dis- 
tance to averaged intraspeaker distance for each feature pa- 
rameter. In the ten utterances, five utterances were trans- 
mitted over clear channel and the remaining five utterances 
were transmitted over the ADPCM system for each speaker 
The results which were similar to that obtained for the utter- 
ance set which comprises only ADPCM speech indicated that 
the feature parameters, especially normalized cepstrum co- 
efficients and the first-order polynomial coefficients, have a 
great amount of individual information which is not affected 
by the difference between clear and ADPCM channels. 

2) Comparison between two distance measures: Before 
starting the speaker verification experiment which uses differ- 
ent combinations of the utterance sets, a preliminary experi- 
ment was carried out to compare speaker verification perfor- 
mance using the two distance measures D 1 and £» a described 
in Section II-D. 

In this experiment the reference template for each customer 
r S A C ^» CtCd by Uainine u " e ™«* transmitted over 
T /ST SyStem " md test Frances were transmitted over 
the LPC vocoder system. Th e ieS ults are given in Table VIU 
showmg error rates for the two distance measures. The error 
rates are quite similar although the error rates obtained using 
n 2 "generally somewhat smaller than those obtained using 
Pi. The correlation coefficient between the two sets of dis- 
unces is 0.992. The calculation time for D 2 is much smaller 
than D } . Based on these results, D 7 is used hereafter in the 
transmission systems experiments. 
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TABLE VIII 

Averages Error Rates. Training Utterances: ADPCM Test 
Utterances: LPC. FR: False Rejection (False Alarm). FA False 
Acceptance (Mrss Rate). 
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" ' 
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(square) 
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1-7]% 


0.70% 


1.21% 
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2-00% 
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0.t6% 




(absolute) 


A Priori 


Estimated 


1.14% 


1.13% 


1.14% 


Fixed 


1.71% 


2.08% 


1.90% 




A Posteriori 


0.12% 





TABLE IX 

Average Error Rates. FR; False Rejection (Falsb Alarm) FA- 
Falsb Acceptance (Miss Rate). 
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FR 


0.29% 


1.43% 


0.86% 


Qw 
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FA 


I 0.62 


0.64 
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(Estimated) 


FR + FA 


0.46 


1.04 
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0.80 




A Posteriori 


Equal 


0.02 


0.08 


0.02 






FR 


1.43 


0.86 


1.14 


ADPCM 


A Priori 


FA 


0.66 


0.95 


I.I 3 




(Estimated) 


FR +FA 


1.05 


0.91 






: 


1.14 




A Posteriori 


Equal 


0.08 


0.06 


0.12 






FR 


0.57 


1.14 


0.29 


LPC 


A Priori 


FA 


0.97 


0.80 


1.3* 




(Estimated) 


FR + FA 


0.77 


0.97 






2 


0.84 




A Posteriori 




0.04 


0.19 


o.o: 



C. Result of Speaker Verification Experiments 
Table IX shows the summary of the results of speaker veri- 

e ST im t ntS ?' dne comb ™<ions of transmission 
systems. When the reference and test utterances are subjected 
to different transmission systems, the error rate is slightly 
larger than the error rate which is obtained when all the utter- 



ances are subject to the same transmission system. But even in 
the worst case, which is the combination of ADPCM and LPC 
vocoded speech, the average error rate by the estimated 
a priori threshold is only, one percent. It means that the 
speaker venfication method investigated in this paper has little 
degradation even when the reference and test utterances are 
subjected to different transmission systems. 

Fig. 12 shows plots of false rejection and flase acceptance 
for each transmission system combination as a function of in- 
dividual customers. Part (a) shows false rejection rates and 
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TABLE X 

Average Error Rates With No Cepstrum Normalization- 
Training Utterances; ADPCM. Test Utterances: LPC. 



Fig. 12. False rejection (a) and false acceptance (b) as a function of the 
training system testing system pair and customer. C-clear channel 
V-LPC vocoder system, A— ADPCM system. 



part (b) shows false acceptance rates. The reader should note 
that the scales of the two figures are different. A high degree 
of variability in scores exists among customers for each pair of 
transmission systems. The variability between scores for pairs 
of transmission system is almost negligible compared with the 
variability of scores within a pair of transmission systems. 

Table X shows error rates when cepstrum normalization is 
omitted for the combination of LPC vocoder and ADPCM 



Threshold 


Error 


Error Rjaie 


A Priori 
(Estimated) 


False Rejection 
(False Alarm) 


1.14% 


Fatse Acceptance 
(Miss Raic) 


1.30% 


" Average 


1.22% 


A Posteriori 


Equal 


0.52% 



channel transmissions for test and training utterances, respec- 
tively. The distances between reference and test utterances 
are generally much greater than those obtained when cepstrum 
normalization is applied. Accordingly, the parameter b in the 
threshold estimation equation is changed to a value which is 
appropriate to make the two kinds of error rates almost same. 
The larger error rates obtained when cepstrum normalization is 
omitted confirms the effectiveness of the cepstrum normaliza- 
tion technique. 

The error rates for the homogeneous conditions in Table DC 
are slightly different from the previous results described in 
Sections IV- A and IV-D, since the sampling frequency of 
"clear" speech is different between the previous and present 
experiments, and the cepstrum normalization technique was 
not applied in the previous experiments. From these com- 
parisons, it is apparent that when the difference of the trans- 
mission characteristics between reference and test utterances 
is small, cepstrum normalization slightly increases the verifica- 
tion error rate by removing the long-term speaker-related 
information. 

In the next section, the effectiveness of the cepstrum 
normalization will be investigated using an utterance set which 
has very large differences between the transmission character- 
istics of reference and test utterances. 

D. Experiments with Artificial Transmission Variation 

The utterance set by ten male customers and 40 male im- 
postors recorded over a conventional telephone connection 
was used in a speaker verification experiment. All utterances 
were passed through a 3 kHz low-pass filter and sampled at 
6.67 kHz. Training utterances were processed with pre- 
emphasis, whereas preemphasis was omitted for test utter- 
ances. This results in a simple but large difference in frequency 
characteristics between training and test utterances. Two ex- 
periments were performed to study the effect of cepstrum 
normalization; verification using normalized cepstrum and 
verification using unnormalized cepstrum. The results are 
shown in Table XI. There are very large differences between 
the results for normalized cepstrum and unnormalized cep- 
strum. It is evident that cepstrum normalization is very power- 
ful, and small error rates can be obtained even when there are 
large frequency characteristic differences between the training 
and test utterances. 



FURUl: AUTOMATIC SPEAKER VERIFICATION 
TABLE XI 

Avbkage Error Ratbs, Training Utterances: Processed by 
Prebmphasis. Test Utterances: Unprocessed by Preemphasis. 
False Rejection (False Alarm). FA: False Acceptance 
(Miss Rate). 
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VI. Discussion 
A. Comparison Between Cepstrum and Log Area Ratio 

In order to check the advantage of the cepstrum coefficients 
derived through LPC analysis (LPC-cepstrum), which have 
been used m this paper, log area ratio parameters, which are 
arctanh transformation of PARCOR coefficients, were ex- 
tracted from the utterance set (1) and studied. Log area ratios 
were found to be very good parameters for speaker verification 
ui previous experiments by the author [19]. Fig. 13 shows 
distance ratios, for each time function of log area ratios and 
polynomial coefficients derived from them. The results for 
cepstrum coefficients which were extracted from the same 
utterance set was shown in Fig. 10. Comparing Figs. 10 and 
13. U can be seen that cepstrum coefficients are more efficient 
than log area ratios for speaker verification. 

Table XII shows the results of a speaker verification experi- 
ment using log area ratios and polynomial coefficients derived 
from them compared to the results using cepstrum coeffi- 
cients. In this experiment, 10 utterances by 10 customers 
each and a single utterance by 40 impostors were used The 
first three utterances were used to construct a reference tem- 
plate for each customer, and the remaining seven customer 
utterances and impostor utterances were used as test utter- 
ances. A constrained endpoint dynamic time warping tech- 
nique was used in this experiment. The error rate results show 
mios CePStrUm C ° eff,CientS have advantage over log area 

Fig. 14 shows examples of spectral envelopes derived from 
10 cepstrum coefficients or 10 log area ratios for a spoken 
.sentence "We were away a year ago." Log area ratios are trans- 
formed into Iuiear predictor coefficients and the spectral en- 

lm£nl C °-? PUted U5in8 Ae co " elati °n Unction of the co- 
?«™ J™ 6 se * uences of envelope for the first 100 
frames are shown in these figures. The frame interval is 10 ms 
Spectral envelopes derived from cepstrum coefficients are 
much smoother than those derived from log area ratios along 
both the frequency axis and time axis. In other words the 
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Fig. 13. Interspcaker to intiaspeaker distance ratio for each time func- 
tion of log area ratios and polynomial coefficients derived from 
tnem. Ten utterances by five male speakers each were used for the 
analysis. 



TABLE XII 

^^ifT 11 U™***™ Set: Subset of the Utterance 

bBT (i). Threshold: A posteriori Equal Error Threshold. 
Time Registration; Constrained Endpoint Dynamic Time 
Warping. Number op Training Utterances: 3 



Feature Parameters 


Error Rale 


Cepstrum 
Log Area Ralio 


0.80% 
1.59% 



spectral envelope sequence by cepstrum coefficients is more 
stable than that obtained using log area ratios. 

Fig. 15 shows comparisons of four kinds of spectra; short 
time speech spectrum, spectral envelope derived from log area 
ratio, spectral envelope derived from LPC-cepstrum, and spec- 
tral envelope derived from conventional cepstrum coefficients 
which are extracted through Fourier transformation of the log 
power spectrum (FFT-cepstrum). Results for two frames in 
the sentence are presented in the figure. It can be seen that 
spectral envelopes derived from LPC-cepstrum and FFT- 
cepstrum are quite similar and are much smoother than those 
derived directly from LPC parameters, which is very sensitive 
to spectral peaks. 

B. Comparison Between LPC-Cepstrum and FFT-Cepstrum 

As indicated in Fig. 15, a spectral envelope derived from the 
LPC-cepstrum is slightly different from a spectral envelope de- 
rived from the FET-cepstrum. In order to study the effect of 
this difference on speaker verification, several experiments 
were performed using utterance set (6) which consists of fe- 
male utterances. The size of the time window was set to 256 
samples (38.4 ms) to extract the FFT-cepstrum, while the win- 
dow size used to extract LPC-cepstrum was 30 ms. The FFT- 
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FREQUENCY 



0» 

Fig. 14. Spectral envelopes derived from ten cepstrum coefficients 
(a) or ten log area ratios (b) for a spoken sentence, "We were away a 
year ago." Time sequences of the envelope for the first 100 frames 
(1 s long) are shown. 

cepstrum computation time which includes two Fourier trans- 
formations is almost twice that of the LPC-. cepstrum. 

The distance ratio for each parameter derived from the FFT- 
cepstrum was calculated. The result was similar to that for the 
LPC-cepstrum except that the speaker dependent information 
in the FFT-cepstrum tends to concentrate in the first-order 
cepstrum. Overall average distance ratios for cepstrum coeffi- 
cients, the first-order polynomial coefficients and the second- 
order polynomial coefficients are 2.15, 1.89, and 1.45, re- 
spectively, for LPC-cepstrum, and 2.07, 1.83, and 137, 
respectively, for FFT-cepstrum, LPC-cepstrum has slightly 
larger values than FFT-cepstrum, but the difference is very 
small. 

Table XIII shows the results of a speaker verification experi- 
ment using FFT-cepstrum under the same condition as that 
using LPC-cepstrum whose results were presented in Table IIL 
The difference in error rates between these two experiments is 
very small. Speaker verification results using FFT-cepstrum 
when the interval between training and test utterances was 
long was similar to the results obtained using LPC-cep strum. 

It is apparent that the LPC-cepstrum produc.es almost the 
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Fig. 15. Comparison of four kinds of spectra; short time speech spec- 
trum, spectral envelope derived from log area ratio, spectral envelope 
derived from LPC-cepstrum, and spectral envelope derived from FFT- 
cepstrum. (a) For the sound fi/ in "We " (b) For the sound foj 
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TABLE XIII 

Averagb Error Rates. Utterance Set: No. (6) (FFT-Cepstrum). 
FR: False Rejection (False Alarm). FA: False Acceptance 
(Mrss Rate). 
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rcshold 


FR 


FA 


FR + FA 
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A 

Priori 


Estimated 


0.29% 


0.33% 


0.31% 


FUed 


0.86% 


0.74% 


0,80% 


A Posteriori 


0.02% 



same results in speaker verification as the conventional FFT- 
cepstrum, while" it takes only half the time to calculate the 
LPC-cepstrum compared with the FFT-cepstrum. 

C Effectiveness of Polynomial Coefficients 

In order to study the effectiveness of the use of polynomial 
coefficients on speaker verification, an additional experiment 
was performed in which the polynomial coefficients were 



FURUI: AUTOMATIC SPEAKER VERIFICATION 
TABLE XIV 

*!! RACB Ekr ?* RaTO5 - Training Utterances Processed by 
P^bwpkasis. Test Utterances: Unprocessed PwSSSS 
FALSE RejEcti °* fFA"» Alarm). FA.- Fau. V^^t 
(Miss Rate). 
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omitted using only the time functions of the fust to tenth 
cepstrum coefficients. The training and test utterance record- 

^rtr 5 ^ ^ M e *P e ™ent described in 
Secuon V-D, where preemphasis is applied to the mining utter- 
ances but omitted for the test utterances. Cepstrum normal- 
ization was appbed to all utterances. Table XIV shows the 
verificanon error rates including previous results which were 
obtained when polynomial coefficients are included. It can be 
seen that error rates are increased by a factor of three or more 
when polynomial coefficients are omitted. 

D. Optimum Length of Speech Segment for 
Polynomial Expansion 

Jlt'r * Xpe ™ e , nts described *° far, the length of the speech 
exoTna d L T functions of ce P«— efficient are 
been se tn on 3 " Or * 08Onal P«*»«ntal representation has 
for 11 ^ ^ VaJue Was d «ermined to be adequate 

££TS? rr tiODal informati ° n b.tw-n Phonemes. In 
order to check the appropriateness of this value of length 
aiuonai speaker verification experiments were Derforrned 
varymg the length between 50 ms and 210 ms. Training utter- 

aTces The Z T n f on " aIizatio " wo applied to all utter- 
tion In™ , u f ° r 48 ex P eriment m *e Previous sec 
Hon corresponds to the condition of zero length segment The 

SoId Verif,Ca 5 i0n r r With ' P™*™ ^Po ^ri 

threshold are plotted in Fie 16 a? a fim^„ r 

length, including the results 5f JL^So? ??£ 

ceptance and false rejection rates are plotted. The verification 

both r S 4 minimUm fOT 170 ™ "»* erro7ra" nte S es 
both for shorter and longer lengths. increases 

twten if " "J 46 ° PtbnUm Value ' di ^ence be- 

portion I T nUmber of ^putatlon increases in pro- 
portion to the segment length. Based on these considerations 
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(FR *FA)/2 




50 100 

SEGMENT LENGTH 



ZOO 



(mS> 



F t 16 J rror ?T versus . **» Ien fi* of the speech segment for orthogo- 

^nr° mia l eXpansi0n ' Trainin g ""exances weVe proceed wf^ 
preemphasis, whereas test utterances were not. 

TABLE XV 

Averagb Error Rates. Utterance Set. No (1) Time R*™™™^ 



Thru hold 


1" 


FA 


FR + FA 
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A 


Estimated 


0.40% 


0.43% 


0.42% 


Priori 


Fi*cd 


0.57% 


0.58% 


0.58% 


A Posteriori 


0.04% 



it can be concluded that 90 ms is a reasonable value for poly- 
nomial expansion in this speaker verification system. 

K Comparison Between Unconstrained Endpoint and 
Constrained Endpoint Dynamic Time Warping Methods 

Table XV shows speaker verification results when con- 
strained endpoint dynamic time warping is used. Other condi- 
i t ^ 38 ex P erin »ent whose results were shown 

1,1 , en ° r ratC ming At "Strained endpoint 

rTnim 'I tWiCC " ""constrained end- 

point method. This result shows the advantage of the uncon- 

r o Twhh n rt 0int dynamiC time Waipine method ' whi <* « 
and fma V a 77 fa l0Cati ° n of both * e ^tial 
IS framCS / ue 10 breath n °"e, etc., over the constrained 
endpoint method. 



* Effectiveness of Dynamic Time Warping Guided by 
Shorter One of Either Input or Reference 

Jiff 0 " ^ WSS " ated *" °P timum raalche s are ob- 
tained by using as guide the shorter of either the input or 
^ference contours. This warping procedure was used in all 
the speaker verification experiments described in this pTper 
To show the effect of not observing this procedure aTX 
ttonal experiment was performed. Utterance set (5) was used 
m a speaker verification experiment in which the input utter- 
ance exclus.vely was used as the guide. This is referred to as 
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the UEGI (unconstrained endpoint guided by input) method, 
in contrast to the UEGS method [unconstrained endpoint 
guided by (the) shorter (of reference and input)] adopted in 
all other experiments. 

A reference template of each customer was constructed 
using five utterances and updated at every access by the cus- 
tomer for the first seven test utterances (method 2) and up- 
dated every seventh access by the customer for the remaining 
test utterances (method 1). Decision thresholds were set 
a priori based on (12). 

Mean error rates for this utterance set using the UEGI pro- 
cedure are plotted in Fig. 17 along with the results obtained 
earlier using the UEGS procedure. It can be seen that, al- 
though false acceptance error rates are comparable for the two 
techniques, false rejection rates for the first and middle seven 
input utterances are much greater for the UEGI procedure. 
This outcome may be attributed to the fact that until stable 
references are established by updating, the lengths of reference 
and input utterances are quite variable. Therefore, large dis- 
crepancies can be expected between the UEGI and UEGS pro- 
cedures. However, with stable references associated with the 
last seven input utterances the lengths of input and reference 
utterances are more consistent and little or no discrepancy is 
expected between the two techniques. 

When warping is guided by the input utterance, the first 
frame of the input may be warped to the first through 
(5 + l)th frame of the reference where 6 specifies the width of 
the allowable range. Similarly, when warping is guided by the 
reference, the first frame of the reference may be warped to 
the first through 6 + 1th frame of the input. 

An experiment was carried out using 26 utterances from 
each of the 21 male customers in utterance set (5). Each cus- 
tomer's utterances were paired with the customer's reference 
and matched two ways, using the input utterance as the guide 
and the reference utterance as the guide. For each such pair, 
the slave frame matched to the first guide frame for the better 
of the two matches (the match resulting in the lower overall 
distance) was tabulated. This tabulation is presented in the 
histograms of Fig. 18. 

Along the abscissa is plotted the slave frame minus one 
(matched to guide frame number one) with the input as slave 
plotted along the positive axis and the reference as slave 
plotted along the negative axis. Equivalently, the positive axis 
represents matches in which the reference is guide while the 
negative axis represents matches in which the input is guide. 
Each histogram point represents the number of optimum 
matches corresponding to the indicated slave frame. The re- 
gion enclosed by the shaded vertical bars represents optimum 
matches which can be obtained by using either the reference 
as guide or input as guide. For example, optimum matches 
within the shaded region to the left of the j^-axls, which are 
actually obtained using the input as guide, are substantially the 
same when guided by the reference, matching the first refer- 
ence frame to the first input frame. Thus, from Fig. 18(a) all ' 
but 9.5 percent of the optimum matches are obtained by using 
the reference as guide, while all but 5.7 percent are obtained 
by using the input as guide. 

Fig. 18(b) and (c) decompose the matches of Fig. 18(a) 
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Fig. 17. Comparison of error rates for two dynamic time warping 
techniques; UEGI and UEGS. Results for utterance set (5). 
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Fig. 18. Histograms for the starting frame of the optimum warping 
function, where a positive number means that warping function starts 
from the input axis and a negative number means that it starts from 
the reference axis. N: number of frames of reference, function, 
number of frames of input function. 



into two categories. In Fig. 18(b) all the optimum matches 
are shown for which the input length is less than or equal to 
the reference length, while in Fig. 18(c) are shown all the 
optimum matches for which the reference length is less than 
the input length. It can be seen immediately that the greatest 
number of optimum matches is associated with using as guide 
the shorter of the input and reference. That is, in Fig. 18(b) 
all but 2.6 percent of the optimum matches are obtained by 
using the input as guide, while in Fig. 18(c) all but 3.7 percent 
of the optimum matches are obtained by using the reference 
as guide. 
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TABLE XVI 

Average Errok Rates. Utterance Set. No. (I). Time Registration 
Constrained End-Point Dynamic Time Warping. Number op 
Training Utterances: 3. FR: False Rejection (False Alarm) 
FA: False Acceptance (Miss Rate). 
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G Effect of the Number of Training Utterances 

Table XVI shows the results of a speaker verification experi- 
ment in which three utterances were used to construct a refer- 
ence template. Other conditions are the same as the experi- 
ment whose result was shown in Table XV (Section XI-E) 
which means that constrained endpoint dynamic time warping 
method was used. Comparing Table XV and XVI, it can be 
seen that using three utterances to make a reference template 
is not adequate. The error rate becomes almost twice that ob- 
tained when five training utterances are used. 

Next, the number of training utterances was increased to 
ten, and a speaker verification experiment was performed 
However, it produced no improvement compared with the re- 
sults using five training samples to construct a reference tem- 
plate. It can be concluded that five utterances are necessary 
and sufficient to make a reference template. 

H. Threshold Estimation 

In this paper, (12) is used to set an a priori decision thresh- 
old for each customer. This equation and two parameters in it 
were determined experimentally. Fig. 19 shows the relation 
between n DB {k)- a DB (k) and equal error threshold 6„(k) 
which produces the equal error of false acceptance and false 
rejection. This is the result of the speaker verification experi- 
ment using utterance set (3) which produced the error rates 
shown in Jable VI. The correlation coefficient between 
JJ««W~ <W*0 and 0 eq (k) calculated from these values is 
0.753. This result indicates the appropriateness of using (12) 
to estimate the optimum decision threshold. 

To determine the effect of varying the parameter b in (12) 
on the error rate, all the customer utterances and impostor 
utterances which were tested in the speaker verification ex- 
periment using the mixed transmission system (Section V) 
were scanned by varying the parameter b. The parameter a 
was set to 0.6, which was determined experimentally The 
number of errors was tabulated at each step by comparing the 
actual overall distances with the estimated threshold using 
the vaned parameter value b. False acceptance rate and false 
rejection rate were averaged and plotted in Fig. 20. Results 
lor mne conditions, which are nine combinations of three 
training systems and three testing systems, are shown. As the 
X ° f ' nCrease in false acceptance rate for large threshold 

T mCreaS ? iD fahe rejection «te. for small threshold 
values, the averaged error rate has a concave slope as a function 
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of b. The optimum value of the parameter b, which produces 
the minimum average error rate, is seven almost irrespective of 
the experimental condition. This is the value consistently used 
in this paper to estimate an optimum threshold for each cus- 
tomer. As there is some tradeoff between the two kinds of 
error rates, if it is desirable to keep the false acceptance rate at 
a much lower value, the parameter b should be set at a value 
smaller than seven, even though it increases the false rejection 
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Fig. 21. Error rate versus decision threshold. The effects of threshold 
variation on the average of false rejection and false acceptance error 
rates are shown using the threshold. Results fox nine training system- 
testing system pairs are plotted. 



rate. Conversely, values of b larger than seven produce a 
smaller false rejection rate and a larger false acceptance rate. 
The dashed line in Fig. 19 indicates the relation 

Q e q = 0.6 (fi DB - a DB ) + 7. (13) 

Fig. 21 shows verification error rate, which is the mean value 
of false acceptance rate and false rejection rate, as a function 
of decision threshold for the same experimental conditions. 
Results for the nine conditions are plotted. The reader should 
note that the scale of this figure is different from the previous 
one. This result shows that the optimum value of the thresh- 
old varies considerably depending on the utterance set. Thus, 
it is very difficult to set the threshold in advance independently 
of the utterance set. 

These results' indicate the. effectiveness of the threshold 
estimating method using (1 2). 

/. Withholding Decision 

Another tabulation was carried out to assess the effect on 
error rate of withholding decision (sequential decision) on 
trials for which \D T - 6 | < A, where D T and 0 are the overall 
distance and threshold, respectively. When the decision is 
withheld on a given utterance, a new distance is calculated 
which is the mean of the distances of the withheld utterances 
and the succeeding utterance. Utterance sets (1), (3), (5), and 
(6) were used in this experiment. As these utterance sets in- 
elude only one utterance for each impostor, impostor utter- 
ances were not used in this experiment. 

Fig. 22 shows error rates as a function of the withholding 
threshold A. Part (a) shows the results when reference tem- 
plates were updated every seventh trial for each customer, and 
part (b) shows averaged error rates over several conditions 
when the interval between training and test utterances were 
varied from six days to six weeks. Fig. 23 shows the per- 
centage of withheld trials, which is the percentage of addi- 
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Fig. 22. -Error rate versus withholding threshold, (a) False rejection 
and false acceptance rates on the condition that a reference template 
is updated every seventh access by each customer. Mean time in- 
terval between training and test utterances is six days, (b) False re- 
jection and false acceptance rates averaged over several conditions 
when the time interval between reference and test utterances is 
varied from six days to six weeks. 
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Fig. 23. Percentage of withheld trials versus withholding threshold. 
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tional trials, as a function of the withholding threshold A 
tig 24 shows the relation between the percentage of withheld 
trials and the error rate normalized by the error rate obtained 
without withholding. Pan (a) and part (b) correspond to part 
(a) and part (b) in Fig. 22, respectively. Fig. 24 indicates that 
at least 30 percent improvement in error rate can be obtained 
with decisions withheld on five percent of the trials and an 
average 73 percent improvement can be obtained with de- 
cisions withheld on ten percent of the trials. 

/ Combination with Pitch and Intensity Contours 

Speaker verification systems based on pitch and intensity 
contours have been evaluated in Bell laboratories using the 
same utterance sets used in this paper [1] -[5] 
tour, T inf °, rma 4 tion conveyed by pitch and intensity con- 
^1 '° n !l erCd , t0 be alm ° St inde P* n <*nt of that conveyed 
fe^S- ^ combina *°n of these two kinds of in- 

formation wiU improve the performance 
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In order to test the independence of these two kinds of in- 
formation, the distribution of speaker verification error rates 
by cepstrum and that by pitch and intensity contours were 
compared with each other and a correlation coefficient be* 
tween them was calculated. When all the error rates plotted 
m Fig. 12 and the error rates obtained with the same condi- 
tions using pitch and intensity contours are used, the correla- 
tion coefficient is 0.22 and -036 for false rejection and false 
acceptance, respectively.. It can be concluded that the two 
kinds of information are fairly independent. Based on these 
results, an improvement in performance can be expected by 
combining these two kinds of information. 

VII. Summary 

A new system for automatic speaker verification has been 
implemented on a 16-bit laboratory computer and evaluated. 
A fixed, sentence-long utterance is analyzed by cepstrum co- 
efficients by means of LPC analysis. Frequency-response dis- 
tortions introduced by transmission systems are removed 
automatically. Time functions of cepstrum coefficients are 
expanded by orthogonal polynomial representations and com- 
pared with stored reference functions. After dynamic time 
warping, a decision is made to accept or reject an identity 
claim. Reference functions and decision thresholds are up- 
dated for each customer. The total processing time is approxi- 
mately 40 times real time in this computer simulation. 

In the first part of the experiment, three sets of utterances 
were used for the evaluation of the system.' The first and 
second sets each comprises 50 utterances by ten customers 
each and a single utterance by 40 impostors recorded over a 
conventional telephone connection. The third set comprises 
26 utterances by 21 customers each and a single utterance by 
55 impostors recorded over a high-quality microphone. The 
first and third sets were uttered by male speakers, whereas the 
second set was uttered by female speakers. The evaluation in- 
dicated mean error rates of 0.19 percent, 036 percent, and 
0.77 percent for each utterance set, respectively. 

Second, the first utterance set was processed by an ADPCM 
coding system and an LPC coding system. These utterance 
sets were used for a speaker verification experiment together 
with an unprocessed utterance set. Experimental results indi- 
cate that the transmission system affects the verification ac- 
curacy only slightly even if the reference and test utterances 
are subjected to different transmission conditions. 

Third, the time interval between reference and test utter- 
ances was changed from six days to six weeks. Results of the 
experiment indicate no significant increase of verification error 
with the increase of time interval. 

These results verify the robustness of the new speaker verifi- 
cation system presented in this paper. Some discussions on 
new techniques used in this system are also included in this 
paper. 

Further investigations, current or projected, include a large- 
scale and long-term evaluation over telephone lines permitting 
direct customer access and on-line response, and specialized 
hardware processing to improve response time. 
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